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Abstract 

Decisions about pharmacotherapy are being taken by medical doctors and authorities based on comparative 
studies on the use of medications. In studies on fertility treatments in particular, the methodological quality is 
of utmost importance in the application of evidence-based medicine and systematic reviews. Nevertheless, flaws 
and omissions appear quite regularly in these types of studies. Current study aims to present an overview of 
some of the typical statistical flaws, illustrated by a number of example studies which have been published in 
peer reviewed journals. Based on an investigation of eleven studies at random selected on fertility treatments 
with cryopreservation, it appeared that the methodological quality of these studies often did not fulfil the 
required statistical criteria. The following statistical flaws were identified: flaws in study design, patient selection, 
and units of analysis or in the definition of the primary endpoints. Other errors could be found in p-value and 
power calculations or in critical p-value definitions. Proper interpretation of the results and/or use of these 
study results in a meta analysis should therefore be conducted with care. 

Key words: Statistics in infertility studies, subfertility, methodological study requirements, cryopreservation. 



Introduction 

The Cochrane Collaboration (Cochrane, 1989) was 
set up to provide the pharmacological industry, 
health care specialists and patients, high quality and 
independent information on the impact of health care 
interventions, by means of systematic reviews of ran- 
domised studies. Two phases in the systematic eval- 
uation, among which quality assessment of study 
methods and the statistical pooling of results so 
called 'meta-analysis', depend on the quality of the 
study reports. Concern about the quality of study re- 
ports and the risk of bias has lead to consolidated 
standards (CONSORT; Consolidated Standards of 
Reporting Trials (http://www.consort-statement. 
org/)) which have been adopted by a lot of journals, 
to which their publications must satisfy. Although 



CONSORT offers a useful framework, studies on 
fertility treatment need additional requirements to 
the design of the study and to the analysis of the 
study results. This study tries to give an overview of 
typical incorrect statistical analyses in recent fertility 
treatment studies with cryopreservation. We investi- 
gate if statistical flaws of the following type were 
made: flaws in study design, patient selection, and 
units of analysis or in the definition of the primary 
endpoints, p-value errors, power calculations, mini- 
mal important differences or critical p-value defini- 
tions. Although previous studies (Barlow, 2003; 
Dickey, 2003; Salim Daya, 2003; Vail and Gardener, 
2003; Arce et al., 2005) already addressed such 
flaws, current study will focus on specific studies, 
namely those studies where frozen - thawed embryo 
transfers were included in the study objectives. 
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Materials and Methods 

The random selection of publications was based on 
a literature search for peer reviewed publications in 
Thomson Reuters Web of Knowledge containing the 
following keywords: fertility, frozen embryo transfer, 
frozen embryo replacement, frozen embryo cycle, 
FET and cryopreservation and pregnancy during a 
14 year period (1995 till 2009). Moreover, the au- 
thors hand searched conference abstracts of major 
proceedings (e.g. ESHRE and ASRM) as well as ref- 
erence lists of selected papers. Based on titles and 
abstracts of the search results, a set of 11 studies 
could be selected by the authors for the current eval- 
uation. 

All selected studies were published in interna- 
tional scientific journals, including Human Repro- 
duction, Fertility and Sterility, Molecular and 
Cellular Endocrinology. Each study was examined 
for the following typical statistical flaws: 

1. Study design 

Double blind studies 

If 2 treatments have a possible different effect or 
route of administration, then the only correct study 
design is a double blind, a double placebo or a 
placebo controlled research. However, from a med- 
ical-ethical point of view, a double blind placebo 
controlled study often has a limited feasibility. More- 
over, adapting the appearance of treatments is gen- 
erally technically difficult and expensive and 
therefore less applicable. A blind research is then 
most apparent, but it means in this case no definite 
conclusions can be drawn. In practice, in this type of 
study it is nearly impossible to ensure that the re- 
searcher remains blind for all important parameters 
of the research because a patient can describe at any 
time which treatment she undergoes purely by the 
description and appearance of the used medication. 
Other types of design flaws, e.g. lack of superiority, 
equivalence and non-inferiority, avoidance of coint- 
ervention or treatment bias (when some subjects are 
receiving other (unaccounted for) interventions at the 
same time as the study treatment), occur frequently 
as well and have been discussed in depth by Daya 
(2006). Other types, but less suitable studies for fer- 
tility treatment investigation are: 

- prospective randomised study 

- retrospective studies (looking back in time). 

Crossover studies 

In a crossover study the examined persons are 
divided into two groups. The first group receives 



firstly treatment A followed by treatment B, whereas 
the second group is treated in reversed order. An ad- 
vantage of this research set-up is that the minimum 
required number of test persons, which is necessary 
to detect an effect, can remain relatively small. 
Studies of fertility treatments are special, in the sense 
that the treatment stops, once success (a pregnancy) 
is being reached. A direct consequence of such ex- 
treme form of 'carry-over' is that cross-over studies 
are unsuitable for fertility treatments (Senn, 1993). 
In other fields, however, cross-over studies may well 
be suitable. 

ITT principle 

In the intention to treat - analysis (ITT) patients are 
analysed in the arm in which they have been classi- 
fied initially and where drops outs are counted. If 
one does not count the persons who drop out, it 
leads, in general, to an over-estimate of the treatment 
effect. In this case one compares only those patients, 
who have continued the treatment (and who, for ex- 
ample, did not have too many side effects). One fre- 
quently uses for that a technique where the last 
measured value is counted as an endpoint (English: 
last observation carried forward). 

2. Patient selection 

The most used selection criteria in fertility treatment 
studies are age of the female patient, her ovulatory 
status and the body mass index (BMI). Also the num- 
ber of previous IVF attempts is an important selec- 
tion criterion. Since patients are seen in all kinds of 
stages of treatment, this can be accepted in medical 
practice. It is known, however, that the result of IVF 
decreases with the number of earlier attempts 
(Templeton et ai, 1996); patients with the most 
favourable forecast become pregnant more rapidly. 

This can cause a possible bias of the results if the 
two groups are not divided equally by the number of 
previous IVF attempts. These type of important 
prognostic confounders must be taken into account 
in the study design, by means of stratification in as- 
signing the treatment to the patient (by randomiza- 
tion of the medicines in subgroups with the same 
number of preceding IVF/ICSI attempts), or in the 
statistical analysis. Apart from achieving prognostic 
factor balance by stratification prior to randomiza- 
tion, it can also be achieved by the technique of min- 
imization. At the point of assignment of each new 
patient to one of a number of treatments, minimiza- 
tion involves calculating for each treatment group 
the comparative degree of imbalance that would 
occur if the patient were assigned to that group 
(McEntegart, 2003). 



274 



F, V & V in ObGyn 



3. "Unit of analysis' flaws 

Simple group comparing tests, such as the t-test or 
Mann - Whitney test for continuous data and Chi 2 
(X 2 ) or Fisher's test for categorical data, require that 
the observations are statistically independent. At the 
allocation of test persons in several arms of the study, 
this will generally mean that only one observation 
per patient has been incorporated in such analysis. 
The hierarchical character of sub fertility data with, 
for example, several oocytes, embryos and several 
implants per treatment cycle, and several treatment 
cycles per woman, can lead rapidly to unit of analy- 
sis flaws. Use of several observations per woman 
leads to unforeseeable biases in the estimate of the 
treatment effect differences. It can lead to unjust nar- 
row confidence intervals and low p- values. 

4. Primary endpoints in the study 

The primary outcome of sub fertility studies must be 
preferably live births, such as the baby take-home 
rate and the cumulative baby take-home rate. Side 
effects in fertility treatments can lead to flaws in the 
statistical analysis. Ovarian hyper-stimulation syn- 
drome is a typical side effect, which can be consid- 
ered as a treatment error. Other important side 
effects, such as ectopic pregnancy and aborted preg- 
nancies only appear after a partially technical suc- 
cess to reaching a pregnancy. These events lead 
however to two methodological consequences. In the 
first place it is usual to report pregnancy-related side 
effects as a percentage of the pregnancies, instead of 
as a percentage of the randomised women. Such an 
approach loses however the advantages of ran- 
domised comparison. It can also be misleading, be- 
cause it is possible to have a higher percentage 
miscarriage per pregnancy in the group with the low- 
est percentage miscarriage per woman. 

5. p-Value calculation 

The p-value or exceedance probability (of a given 
sample outcome) is the probability that the value of 
the test statistic is exceeded (left, right or two-tailed) 
given the distribution by the null hypothesis. The p- 
value indicates how far extreme the observed value 
is for the test statistic in the distribution of the null 
hypothesis: the smaller the p-value, the more ex- 
treme the outcome. In practice values of 5% and 1% 
are used as a border. P-values are mostly calculated 
in fertility treatment studies with the Cochrane 
Haenszel test or Z-test. The Cochran-Mantel- 
Haenszel test is a test % 2 that examines if an associa- 
tion between two variables after control for other 
variables is present. The test measures the strength 



of this association. A Z-test is mostly applied on pro- 
portions where the test statistic is assumed as a nor- 
mal distribution, which generally holds well for large 
sample sizes (Bland, 2000). Calculated p-values are 
smaller or larger than the significance level, but 
never equal to the significance level. Regularly, pub- 
lished studies show p-values exactly equal to 5%. 
Mostly, in such cases the calculated p-value has been 
rounded down. 

6. Power calculations 

Power calculations are calculations to determine the 
minimum sample size required for effective statisti- 
cal significant differences between groups in a study. 
The sample size depends on the effect size which 
one expects, as well as on the probability which is 
required to find a result which is present in the pop- 
ulation (the power). Frequently, power calculations 
in fertility treatment studies are, wrongfully, retro- 
spectively carried out or are even absent. 

7. Critical p-values in sequential studies 

Sequential studies are usual practice in biomedical 
research as a result of ethical, administrative and 
economic reasons. Statistical hypotheses in such 
studies are repeatedly, at several times, tested, after 
a new group of new observations has been com- 
pleted. 

The analysis of the results takes place before the 
final number of experimental units has been reached. 
If this happens in an uncontrolled manner, the term 
peeking is used. A statistical fine must be applied in 
such case, because if enough interim analyses are 
carried out, and if the result of the statistical test lies 
on the border between significant and not significant, 
eventually one of the analyses, will result, wrong- 
fully, in a p < 0.05. Sequential statistical methods in- 
clude not only finding suitable methods for the 
provision of critical p-values on each interim test 
control, but also developing efficient inferential pro- 
cedures for secondary analyses, such as parameter 
estimates, confidence interval calculations, etc. Tech- 
niques for sequential analysis, where data continues 
to accumulate, are available in literature. However, 
the practice is that such analyses are generally car- 
ried out without making arrangements for the nec- 
essary adaptations of the type I-errors. A simple 
method for correction of the critical p-value (al- 
though severe) is the use of the Sidak inequality 
(Sidak, 1967). This results in an adapted critical p- 
value which is given by the formula 1 - (l-p)\ where 
k the number of interim analyses and p the nominal 
critical p-value (generally 0.05).The reason for this 
is that the correction is arranged for several compar- 



STATISTICAL FLAWS RAISE DOUBTS ON CONCLUSIONS - VAN GELDER AND NUS 



275 



Table 1. 



Critical p-values in sequential studies in fertility studies including frozen-thawed embryo transfers. 



Number of Nominal critical Corrected critical p-value according Corrected critical p-value 

interim analyses p-value to Armitage McPherson (1969) according to Sidak (1967) 

1 0.05 0.05 0.05 

2 0.05 0.025 0.03 
5 0.05 0.01 0.016 



isons when these are entirely independent. But suc- 
cessive interim analyses are mostly not completely 
independent of each other, but to a certain degree 
only. For this reason the Sidak adaptation gives a 
lower bound for the critical p-value. The Armitage- 
McPherson adaptation (Armitage et al, 1969) is a 
less strict adaptation than the Sidak correction 
(Sidak, 1967) (Table 1). 

8. Minimal important differences (MID) 

One other methodological flaw is the absence of the 
definition of 'minimal important differences', being 
the smallest benefit of treatment that would result in 
clinicians recommending it to their patients. The 
MID is necessary to calculate sample size for ran- 
domized clinical trials, but its chosen value is often 
arbitrary. Power calculations can be performed to 
calculate a statistical difference given a defined alfa- 
error (incorrectly accepting that a difference exists 
between the two treatments) and beta-error (incor- 
rectly accepting that no difference exists between the 
two treatments), but the question how big should the 
difference really be to be clinically relevant is fre- 
quently missing. This issue is very important to be 
defined prior to embarking on a trial and sophisti- 
cated statistics. Van Walraven et al. (1999) investi- 
gated the practicability of surveying physicians to 
elicit the MID for clinical trial sample-size calcula- 
tion. 

Results 

Table 2 represents a detailed overview of the analysis 
for statistical flows of 1 1 randomly selected fertility 
studies which include frozen-thawed embryo trans- 
fer analysis. Most of the 1 1 selected studies contain 
flaws in patient selection: patients were not ran- 
domly selected or they were incorrectly followed 
over the period of several cycles. In some studies, 
flaws in primary endpoints were observed. Other 
errors are reported as purely statistical flaws (unit of 
analysis errors, misapplication of cross-over design, 
technical errors in power or significance calcula- 
tions), issues related to clinical preference (eligibility 
criteria, choices of primary outcome) and study 



design issues (blinding, ITT, randomisation). All 
studies deal incorrectly with the units of analyses 
(namely cycles are used instead of patients). Also un- 
suitable retrospective designs are used in the major- 
ity of the studies. Furthermore, in only 1 study 
preceding power calculations had been carried out. 
In none of the studies investigated, minimal impor- 
tant differences have been presented, which is a 
major flaw and underestimated in its importance. 
Finally, adjustments of critical p-value in sequential 
studies were missing. 

Discussion 

Particularly in fertility treatment studies, the method- 
ological quality is very important for the application 
in evidence-based medicine and systematic reviews. 
Nevertheless errors and omissions occur in these 
studies regularly. In this article an overview has been 
given of the most appealing statistical flaws. The se- 
riousness and the impact of these flaws differ per 
study. A certain study even wrongly rejected the null 
hypothesis with respect to the most important re- 
ported parameter. The flaws that were identified put 
therefore doubts at the conclusions of these specific 
studies. It is of utmost importance that when studies 
are being registered and set up, the primary and 
secondary endpoints should be fixed. Also power 
calculations must be discussed. The correct applica- 
tion of medical statistics in reproduction studies is 
very important, but unfortunately in practice, it is not 
always conducted well. Although, it is not incorrect 
to publish a case-series or comparative cohort study 
of an intervention, even though the design is not as 
strong as a randomized controlled trial (RCT), the 
study cannot be interpreted with the same causal in- 
ference. RCT is the gold standard for evaluating the 
effectiveness/ efficacy of interventions. All other 
study designs where no random sequence generation 
is used are at risk of bias, leading us away from 
drawing correct conclusions from the studies. 

Previous authors, such as Barlow (2003) Dickey 
(2003), Salim Daya (2003), Vail and Gardener 
(2003) and Arce et al. (2005), have addressed some 
of the above issues mentioned above. Current study 
has reconfirmed that statistical flaws still occur in 
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recent studies. Methodological flaws in the study de- 
sign compromise the internal and external validity 
of the results and conclusions. A plea is made to im- 
plement the publication of peer-reviewed study pro- 
tocols before embarking on a trial to increase the 
quality of the studies on fertility treatment. In order 
for studies to be analysed correctly, researchers 
should receive post-academic courses on statistical 
theory and design and analysis of studies. Clinical 
trials are to be registered with one of the Interna- 
tional Committees of Medical Journal Editors' 
recognised trial registers at the time of their incep- 
tion. This registration process should include defin- 



ing the type of statistical analysis to be used for that 
specific trial as well as the registration of the power 
analysis. Journals are advised to focus more on the 
statistical soundness of submitted papers, and should 
perhaps by default send each accepted manuscript to 
a specialised reviewer for analysis of the statistical 
soundness of that specific study. Reviewers should 
be trained to identify possible errors in the study de- 
sign and statistical analysis in submitted manu- 
scripts. A combination of such actions might help to 
reduce the occurrence of statistical flaws in research 
and will hence result in the publication of solid re- 
search papers. 



A Call for action 

As an 'exercise', we challenge the readers of this journal in identifying possible flaws in study design and 
statistical analysis of the paper of Zhang ef af. (2006) that investigates the effect of traditional Chinese herbs 
combined with low dose human menopausal Gonadotropin applied in frozen thawed embryo transfer 
(Chin J Integr Med. 2006;12, 244-49). This paper has been randomly chosen and will fit the above purpose. 
In the next issue of this journal, the possible flaws associated to this 12 th paper will be presented by the 
authors and can be compared with the readers' observations. 
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